[ORT 1.17.0 Release] Cherry pick 1st round #19243

YUNQIUGUO · 2024-01-23T20:40:53Z

Description

[ORT 1.17.0 Release] Cherry pick 1st round

PR authors please take a look, and let me know if there are any questions about the changes or approve accordingly.

Motivation and Context

### Description  ### Motivation and Context

…19182) ### Description Extends the code coverage to Entroy, Histogram and Distribution calibration method, fix bugs while doing it. ### Motivation and Context Bugs detected in [Olive](https://github.com/microsoft/OLive).

### Description allow proxy to load model with 1GB <= size < 2GB resolves #19157.

…rted (#19179) ### Description show warning when numThreads is set but threads is not supported. Resolves #19148, #18933 for web: when crossOriginIsolated is false. for node: always disable.

…ry (#19174) ### Description Check the ep_cache_context node property for EPContext node, and don't allow relative path like "../file_path"

### Description  1. Make JBLAS codes an external module of ORT. 2. Move q4 gemm code to contrib_ops. 3. Update template kernel library to v0.1 release. ### Motivation and Context  We found that the current LLM model performance is far below our expectations. Here is some performance data collected on Mistral-7B model with Xeon-8480: 8 threads | prompt length=32 past_len=32 | prompt length=1 past_len=32 -- | -- | -- ORT-main | 1220ms | 263ms Neural-speed | 564ms | 87ms ORT-this PR|597ms|120ms Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code, there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks. Through some statistics analysis, the summary latency of `MatMulNBits` is 86.7ms The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So other OPs introduce an extra 30ms latency. The performance of MatMulNBits in this PR meets our expectations. ### Remain Issues 1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses TaskGranularityFactor to scale its number of threads. This is not expected in our code design. It may slow down the hybrid CPU performance by 30~40%. 2. Prepack uses a single thread which is very slow to init a session. 3. MatMulNBits with zero points will fall through to COMP_FP32 even accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is not optimized for now. It will be updated in the future. So, for an int4 model with zero points, whether the accuracy_level is 0 or 4 will be no difference.

### Description upgrade packages version. ``` # npm audit report electron 23.0.0-alpha.1 - 23.3.13 Severity: moderate ASAR Integrity bypass via filetype confusion in electron - GHSA-7m48-wc93-9g85 fix available via `npm audit fix --force` Will install [email protected], which is a breaking change node_modules/electron get-func-name <2.0.1 Severity: high Chaijs/get-func-name vulnerable to ReDoS - GHSA-4q6p-r6v2-jvc5 fix available via `npm audit fix` node_modules/get-func-name semver <=5.7.1 || 6.0.0 - 6.3.0 || 7.0.0 - 7.5.1 Severity: moderate semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw fix available via `npm audit fix` node_modules/cross-spawn/node_modules/semver node_modules/global-agent/node_modules/semver node_modules/semver ```

### Description This PR updates the LLaMA-2 attention fusions by adding the following. - Loading the PyTorch model from Hugging Face with the `LlamaAttention` class before exporting - Updating the attention mask pattern matching to support another case This PR also fixes [this issue](#19040). ### Motivation and Context Recent changes to Hugging Face's `transformers` library break the existing pattern matching. Since the attention fusions aim to change the graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to `LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately it does not matter what nodes comprise the `Set of Attention Nodes` because they will all be removed and replaced by the `Attention Op` in the end. Therefore, it does not matter whether the `LlamaAttention` class or a different attention class is used to load the PyTorch model before exporting because the expected graphs after the attention fusions will look identical no matter the attention class chosen. By loading the PyTorch model with the `LlamaAttention` class instead of other attention classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then exporting it to ONNX, the existing pattern matching will continue to work.

… is not guaranteed (#19195) Fix issue that the generated context cache model inputs/outputs order is not guaranteed ### Description Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model. This is also a break down of PR for multi-partition support. #18865

…der options (#19154) Several changes: 1. To align with other EPs' setting of EP context configs in session options, for example [QNN EP](#18877), EP context configs for TRT EP can be configured through: 1. Session Options: `ep.context_enable`, `ep.context_file_path` and `ep.context_embed_mode` 2. Provider Options: `trt_dump_ep_context_model`, `trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode` 3. Above setting has 1:1 mapping and provider options has higher priority over session options. ``` Please note that there are rules for using following context model related provider options: 1. In the case of dumping the context model and loading the context model, for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be the absolute path or relative path that is outside of context model directory. It means engine cache needs to be in the same directory or sub-directory of context model. 2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory. For example: If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled, if "trt_ep_context_file_path" is "./context_model_dir", - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir" - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir" ``` 2. User can decide the naming of the dumped "EP context" model by using `trt_ep_context_file_path`, please see GetCtxModelPath() for more details. 3. Added suggested comments from #18217

### Description  1. support causal mask in MHA cpu 2. support custom rotary_dim in rotary_emb 3. add bf16 for rotary_emb 4. fix a bug in attention rotary ### Motivation and Context

### Description - Adds the following session options to configure the device: - `soc_model`: The SoC model number. Refer to the QNN SDK documentation for valid values. Defaults to "0" (unknown). - `htp_arch`: The minimum HTP architecture the driver will use to select compatible QNN operators. - `device_id`: The ID of the device to use when setting 'htp_arch'. Defaults to "0" (for single device). ### Motivation and Context Allow more configuration.

…oat16 (#17031) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `

### Description Adds a job to create a nightly python package for ORT/QNN on Windows ARM64. Must build onnxruntime-qnn with python 3.11 and numpy 1.25. **Note: pipeline run may take up to 3 hrs** ### Motivation and Context Make it possible to get a nightly python package with the latest updates to QNN EP. Issue #19161

### Description Update unet fusion for [stable diffusion webui extension](https://github.com/tianleiwu/Stable-Diffusion-WebUI-OnnxRuntime): (1) Update fusion pattern to support fp16 unet model. (2) Add progress bar (3) Use a cached map to speed up dtype or shape lookup in shape inference result. ### Motivation and Context

### Description  Add BuildArch To verify: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=400952&view=logs&j=5b022bb4-70a7-5401-8766-a8a7802c7150&t=291e85c7-5547-590b-50de-4e01fcd4eba3&l=14 ### Motivation and Context

snnn · 2024-01-24T19:15:31Z

I suggest holding off it a moment, since this PR will add a new dependency, neural-speed, to ONNX Runtime. I have some concerns to it. I've just sent an email to the author who added this component, and another email to a few PMs to discuss it. I do not have any objection on adding the dependency, but there are some details that need to be figured out. Please give me a few days to complete the work.

YUNQIUGUO · 2024-01-24T19:32:49Z

I suggest holding off it a moment, since this PR will add a new dependency, neural-speed, to ONNX Runtime. I have some concerns to it. I've just sent an email to the author who added this component, and another email to a few PMs to discuss it. I do not have any objection on adding the dependency, but there are some details that need to be figured out. Please give me a few days to complete the work.

ok, noticed.

### Description  remove old python files ### Motivation and Context  We have a new op MatMulNBits and this one is deprecated.

…age (#19251) C.register_tensorrt_plugins_as_custom_ops() is only available in gpu python package. Add condition to avoid calling it in cpu python package.

TRT EP's GetTensorRTCustomOpDomainList() will create vector of OrtCustomOpDomain objects and release the ownership of those objects. But, thoses objects are not released forever. In session level, we need to make TRT EP remember what OrtCustomOpDomain objects it created and release them at EP destruction time.

### Description Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for onnxruntime CUDA build), there is no need to have a flag to disable cutlass. Changes: (1) Reverted #18761 (2) remove the condition to build cutlass. (3) Fix a few build errors or warnings during testing CUDA 11.4 build. Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later. Flash attention and cutlass fused multihead attention will not be built for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if you want to support latest GPUs. It is better to include it in 1.17.0 (otherwise, the release branch might encounter build failure with CUDA 11.4). Tests: (1) Build with flash attention and efficient attention off: **passed** (2) Build with CUDA 11.4: **passed** Example build command used in Ubuntu 20.04: ``` export CUDA_HOME=/usr/local/cuda-11.4 export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/ export CUDACXX=/usr/local/cuda-11.4/bin/nvcc sh build.sh --config Release --build_shared_lib --parallel --use_cuda --cuda_version 11.4 \ --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \ --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \ --disable_types float8 ``` ### Motivation and Context

### Description Update abseil to a release tag and register neural_speed to CG. ### Motivation and Context Now we are using a non-relesed version of abseil. Using a tag is better.

### Description  [ORT 1.17.0 Release] Cherry pick 1st round PR authors please take a look, and let me know if there are any questions about the changes or approve accordingly. ### Motivation and Context  --------- Co-authored-by: wejoncy <[email protected]> Co-authored-by: Xavier Dupré <[email protected]> Co-authored-by: Yulong Wang <[email protected]> Co-authored-by: Hector Li <[email protected]> Co-authored-by: luoyu-intel <[email protected]> Co-authored-by: kunal-vaishnavi <[email protected]> Co-authored-by: Chi Lo <[email protected]> Co-authored-by: Ye Wang <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: Heflin Stephen Raj <[email protected]> Co-authored-by: Yifan Li <[email protected]> Co-authored-by: Yufeng Li <[email protected]> Co-authored-by: Changming Sun <[email protected]>

wejoncy and others added 17 commits January 23, 2024 12:17

more inputs support for LLM exporter (#19005)

357d385

### Description  ### Motivation and Context

[js/web] allow proxy to load model with 1GB <= size < 2GB (#19178)

de93132

### Description allow proxy to load model with 1GB <= size < 2GB resolves #19157.

[js/web] show warning when numThreads is set but threads is not suppo…

89dae2d

…rted (#19179) ### Description show warning when numThreads is set but threads is not supported. Resolves #19148, #18933 for web: when crossOriginIsolated is false. for node: always disable.

Check the ep_cache_context and don't allow access outside the directo…

eb77b5b

…ry (#19174) ### Description Check the ep_cache_context node property for EPContext node, and don't allow relative path like "../file_path"

Modified the condition to load the optimiser model (#18891)

3c6adf3

YUNQIUGUO requested review from a team as code owners January 23, 2024 20:40

YUNQIUGUO requested review from wejoncy, xadupre, fs-eire, HectorSVC, yufenglee, kunal-vaishnavi, wangyems, adrianlizarraga, snnn, chenfucn and chilo-ms January 23, 2024 20:41

YUNQIUGUO dismissed stale reviews from kunal-vaishnavi and yufenglee via 8bdea53 January 24, 2024 05:24

wangyems previously approved these changes Jan 24, 2024

View reviewed changes

xadupre previously approved these changes Jan 24, 2024

View reviewed changes

tianleiwu previously approved these changes Jan 24, 2024

View reviewed changes

YUNQIUGUO dismissed stale reviews from tianleiwu, xadupre, and wangyems via 0714f0c January 24, 2024 23:57

chilo-ms and others added 4 commits January 24, 2024 17:02

[TensorRT EP] Avoid calling unavailable function with cpu python pack…

aaa577d

…age (#19251) C.register_tensorrt_plugins_as_custom_ops() is only available in gpu python package. Add condition to avoid calling it in cpu python package.

Update abseil to a release tag and register neural_speed (#19255)

099cefb

### Description Update abseil to a release tag and register neural_speed to CG. ### Motivation and Context Now we are using a non-relesed version of abseil. Using a tag is better.

YUNQIUGUO requested a review from a team as a code owner January 26, 2024 20:21

YUNQIUGUO requested review from snnn and yufenglee January 26, 2024 20:22

snnn approved these changes Jan 26, 2024

View reviewed changes

pranavsharma approved these changes Jan 26, 2024

View reviewed changes

yufenglee approved these changes Jan 26, 2024

View reviewed changes

adrianlizarraga approved these changes Jan 26, 2024

View reviewed changes

yf711 approved these changes Jan 26, 2024

View reviewed changes

tianleiwu approved these changes Jan 26, 2024

View reviewed changes

kunal-vaishnavi approved these changes Jan 26, 2024

View reviewed changes

baijumeswani approved these changes Jan 26, 2024

View reviewed changes

wangyems approved these changes Jan 27, 2024

View reviewed changes

YUNQIUGUO merged commit 3fd94a8 into rel-1.17.0 Jan 27, 2024
108 of 111 checks passed

YUNQIUGUO deleted the yguo/cherry-pick-1st-round branch January 27, 2024 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ORT 1.17.0 Release] Cherry pick 1st round #19243

[ORT 1.17.0 Release] Cherry pick 1st round #19243

YUNQIUGUO commented Jan 23, 2024

snnn commented Jan 24, 2024

YUNQIUGUO commented Jan 24, 2024

[ORT 1.17.0 Release] Cherry pick 1st round #19243

[ORT 1.17.0 Release] Cherry pick 1st round #19243

Conversation

YUNQIUGUO commented Jan 23, 2024

Description

Motivation and Context

snnn commented Jan 24, 2024

YUNQIUGUO commented Jan 24, 2024